Remove initial_change when dealing with table updates#950
Remove initial_change when dealing with table updates#950kevinjqliu wants to merge 1 commit intoapache:mainfrom
initial_change when dealing with table updates#950Conversation
pyiceberg/table/__init__.py
Outdated
|
|
||
|
|
||
| @_apply_table_update.register(RemoveSnapshotRefUpdate) | ||
| def _(update: RemoveSnapshotRefUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: |
There was a problem hiding this comment.
Hi @kevinjqliu, I've already implemented _apply_table_update for RemoveSnapshotRefUpdate and the public facing remove branch / tag apis in #822, which is waiting on #758.
If you need to add this functionality urgently, please feel free to use the code in the linked PRs.
There was a problem hiding this comment.
Cool! Thanks for the heads up. I can rebase once those PRs are in.
There was a problem hiding this comment.
Do you think it'll be helpful to enforce any new TableUpdate class to have a corresponding update function?
Like in #952
There was a problem hiding this comment.
I think that would helpful! In case a feature is not implemented, like RemoveSnapshotRefUpdate and RemoveSnapshotsUpdate, the test should atleast print a message saying so.
It would help keep track of which features are / are not implemented, and won't be surprise to the end user or us.
3bb0bbf to
90bacbe
Compare
initial_change when dealing with table updates
|
@HonahX do you mind taking a look at this when you get a chance? |
90bacbe to
02d46d5
Compare
HonahX
left a comment
There was a problem hiding this comment.
Sorry for being late. Thanks for the PR! The initial_change was a workaround when createTableTransaction was added. It will be great if we can find a better way to handle the case without this additional parameter.
| action: Literal["add-spec"] = Field(default="add-spec") | ||
| spec: PartitionSpec | ||
|
|
||
| initial_change: bool = Field(default=False, exclude=True) |
There was a problem hiding this comment.
Instead of removing it directly, shall we go through a deprecation process given this is a public class? We could add a deprecation message (via field validator?) when this field is set explicitly.
| context.add_update(update) | ||
| if update.spec.spec_id == INITIAL_PARTITION_SPEC_ID: | ||
| # no op | ||
| return base_metadata |
There was a problem hiding this comment.
This seems to cause problem if I want to create a partitioned table from beginning. For example,
iceberg_schema = Schema(*[NestedField(field_id=1, name="a", field_type=StringType())])
iceberg_spec = PartitionSpec(*[PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name='test1')])
sort_order = SortOrder(*[SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC)])
txn = catalog.create_table_transaction(identifier=identifier, schema=iceberg_schema, partition_spec=iceberg_spec, sort_order=sort_order)
txn.commit_transaction()
tbl = catalog.load_table(identifier)
print("=====Schemas====")
print(tbl.schemas())
print("=====Specs====")
print(tbl.specs())
print("=====SortOrders====")
print(tbl.sort_orders())
=====Schemas====
{0: Schema(NestedField(field_id=1, name='a', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])}
=====Specs====
{0: PartitionSpec(spec_id=0)}
=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}Although iceberg_spec is given, the table is still created with UNPARTITIONED_PARTITION_SPEC
There was a problem hiding this comment.
Thanks for the example.
on a meta level, this is the type of bug I'm afraid of when refactoring... how can we ensure other cases like this are captured
There was a problem hiding this comment.
I believe adding more tests for this specific case would be helpful. While we have extensive coverage for the logic of updating existing metadata, there are very few tests for create_table_transaction, where updates are applied to an empty metadata set to generate the entire metadata. Since this logic is unique to create_table_transaction, errors related to it are not detected by the current tests.
| context.add_update(update) | ||
| if update.sort_order == UNSORTED_SORT_ORDER: | ||
| # no op | ||
| return base_metadata |
There was a problem hiding this comment.
As shown in the example above, if I specify a SortOrder in the beginning, I end up getting a table with an additional empty SortOrder (UNSORTED)
=====SortOrders====
{0: SortOrder(order_id=0), 1: SortOrder(SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), order_id=1)}|
Hi @kevinjqliu I ran a few experiments and found that removing I’ve implemented this approach and created a draft PR: #1219. I’d love to hear your thoughts on it! |
|
Closing this in favor of #1219 |
Closes #864
Identified in #864,
TableMetadatais initialized with the default Pydantic object for schema, partition_spec, and sort_order, which does not play well with table updates. Specifically, theinitial_changefield is an implementation detail of pyiceberg and does not play well when interacting with the REST API. Table update objects from the REST API does not understand this field.We can safely remove
initial_changeby modifying the logic for dealing with table updates.